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Graph Classification using Signal-Subgraphs: 
Applications in Statistical Connectomics 

Joshua T. Vogelstein, William R. Gray, R. Jacob Vogelstein, and Carey E. Priebe 

Abstract — This manuscript considers the following "graph classification" question: given a collection of graphs and associated classes, 
how can one predict the class of a newly observed graph? To address this question we propose a statistical model for graph/class 
pairs. This model naturally leads to a set of estimators to identify the class-conditional signal, or "signal-subgraph," defined as the 
collection of edges that are probabilistically different between the classes. The estimators admit classifiers which are asymptotically 
optimal and efficient, but differ by their assumption about the "coherency" of the signal-subgraph (coherency is the extent to which the 
signal-edges "stick together" around a common subset of vertices). Via simulation, the best estimator is shown to be not just a function 
of the coherency of the model, but also the number of training samples. These estimators are employed to address a contemporary 
neuroscience question: can we classify "connectomes" (brain-graphs) according to sex? The answer is yes, and significantly better than 
all benchmark algorithms considered. Synthetic data analysis demonstrates that even when the model is correct, given the relatively 
small number of training samples, the estimated signal-subgraph should be taken with a grain of salt. We conclude by discussing 
several possible extensions. 

Index Terms — statistical inference, graph theory, network theory, structural pattern recognition, connectome, classification. 
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1 Introduction 

GRAPHS are emerging as a prevalent form of data 
representation in fields ranging from optical char- 
acter recognition and chemistry flU to neuroscience 121 . 
While statistical inference techniques for vector-valued 
data are widespread, statistical tools for the analysis of 
graph-valued data are relatively rare (TJ. In this work 
we consider the task of labeled graph classification: given 
a collection of labeled graphs and their corresponding 
classes, can we accurately infer the class for a new 
graph? Note that we assume throughout that each vertex 
has a unique label, and that all graphs have the same 
number of vertices with the same vertex labels. 

We propose and analyze a joint graph/class model — 
sufficiently simple to characterize its asymptotic prop- 
erties, and sufficiently rich to afford useful empirical 
applications. This model admits a class-conditional sig- 
nal encoded in a subset of edges, the signal-subgraph. 
Finding the signal-subgraph amounts to providing an 
understanding of the differences between the two graph 
classes. Moreover, borrowing a term from the com- 
pressive sensing literature (3) |U, we are interested in 
learning to what extent this signal is coherent; that is, 
to what extent are the signal-subgraph edges incident to 
a relatively small set of vertices. In other words, if the 
signal is sparse in the edges, then the signal-subgraph 
is incoherent, if it is also sparse in the vertices, then the 
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signal-subgraph is coherent (we formally define these 
notions below). 

This graph-model based approach is qualitatively dif- 
ferent from most previous approaches which utilize 
only unique vertex labels or graph structure. In the 
former case, simply representing the adjacency matrix 
with a vector and applying standard machine learning 
techniques ignores graph structure (for instance, it is 
not clear how to implement a coherent signal-subgraph 
estimator in this representation). In the latter case, com- 
puting a set of graph invariants (such as clustering coef- 
ficient), and then classifying using only these invariants 
ignores vertex labels flj|5j|51. 

While some of the above approaches consider at- 
tributed vertices or edges, we are unable to find any 
that utilize both unique vertex labels and graph structure. 
The field of connectomics (the study of brain-graphs), 
however, is ripe with many examples of brain-graphs 
with vertex labels. In invertebrate brain-graphs, for 
example, often each neuron is named, such that one 
can compare neurons across individuals of the same 
species |7J. In vertebrate neurobiology, while neurons 
are rarely named, "neuron types" |8] and neuroanatom- 
ical regions |9| are named. Moreover, a widely held 
view is that many psychiatric issues are fundamentally 
"connectopathies" flTOllTTl . For prognostic and diagnostic 
purposes, merely being able to differentiate groups of 
brain-graphs from one another is sufficient. However, 
for treatment, it is desirable to know which vertices 
and / or edges are malfunctioning, such that therapy can 
be targeted to those locations. This is the motivating 
application for our work. 

We demonstrate via theory, simulation, analysis of a 
neurobiological data set (magnetic resonance based con- 
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nectome sex classification), and synthetic data analysis, 
that utilizing graph structure can significantly enhance 
classification accuracy. However, the best approach for 
any particular data set is not just a function of the 
model, but also the amount of data. Moreover, even 
when the model is true, given a relatively small sample 
size, the estimated signal-subgraph will often overlap 
with the truth, but not fully capture it. Nonetheless, the 
classifiers described below still significantly outperform 
the benchmarks. 

2 Methods 

2.1 Setting 

Let G : fl — > Q be a graph-valued random variable with 
samples Gj. Each graph G = (V,E) is defined by a set 
of V vertices, V = {vi}ie[v]' where [V] — {1, . . . , V}, and 
a set of edges between pairs of vertices E C V x V. 
Let A : 57 — > A be an adjacency matrix-valued random 
variable taking values a G A C R VxV , identifying which 
vertices share an edge. Let Y : O —> y be a discrete- 
valued random variable with samples t/,*. Assume the 
existence of a collection of n exchangeable samples of 
graphs and their corresponding classes from some true 

but unknown joint distribution: {(Gi, Yi)}ie[n] e ~ Fg,y- 
Our aim (exploitation task) is to build a graph classifier 
that can take a new graph, G, and correctly estimate its 
class, y, assuming that they are jointly sampled from 
some distribution, F& tY . Moreover, we are interested 
solely in graph classifiers that are interpretable with re- 
spect to the vertices and edges of the graph. In other 
words, nonlinear manifold learning, feature extraction, 
and related approaches are unacceptable. 

We adopt the common practice of identifying graphs 
with their adjacency matrices. We note, however, that op- 
erations available on the latter (addition, multiplication) 
are not intrinsic to the former. 

2.2 Model 

Consider the model, J-&.Y, which includes all joint dis- 
tributions over graphs and classes under consideration: 
Fg,y = {Fg,y(-]9) '■ 9 € 0}, where G indexes 
the distributions. We proceed via a hybrid generative- 
discriminative approach Ifl2l whereby we describe a gen- 
erative model and place constraints on the discriminant 
boundary. 

First, assume that each graph has the same set of 
uniquely labeled vertices, so that all the variability in 
the graphs is in the adjacency matrix, which implies that 
Fiq^y = Fa,y- Second, assume edges are independent; 
that is, Fa.y — Yl u ve£ F Auv , Y , where £ QV x V is the set 
of all possible edges. Now, consider the generative de- 
composition F A , Y = F A \ Y F Y , and let F uv \ y = F Auv \ Y = y 
and TT y = Fy=y Third, assume the existence of a class- 
conditional difference; that is, F uv i ^ F uv \\ for some 
(u, v) G £, and denote the edges satisfying this condition 
the signal-subgraph, S = {(u,v) G £ : F uv \ ^ F uvll }. 



Fourth, although the following theory and algorithms 
are valid for both directed and undirected graphs, for 
concreteness, assume that the graphs are simple graphs; 
that is, undirected, with binary edges, and lacking (self- 
) loops (so £ — (^)). Thus, the likelihood of an 
edge between vertex u and v is given by a Bernoulli 
random variable with a scalar probability parameter: 
F uv \ y {A uv ) = Bern(A uv ;p uvly ). Together, these four as- 
sumptions imply the following model: 

Fg,y = {F A , Y (a,y;G) Va G A,y G y : 6 G 0}, (1) 

where 

F A ,Y(a,W,6) = Y[ Re™(a uv ;p uv \ y )ir y 

X Yl Bern ( a nv]Puv), (2) 

uv££\S 

and 9 = {p, tt,S}. The likelihood parameter is con- 
strained such that each element must be between zero 
and one: p G (0, l)( 2 ) x '^'. The prior parameter, tv = 
(tti, . . . , 7T|;y|), must have elements greater than or equal 
to zero and sum to one: 7r y > 0, J^„% = 1. The signal- 
subgraph parameter is a non-empty subset of the set of 
possible edges, S C £ and 5^0. 

We consider up to two additional constraints on S. 
First, the size of the signal-subgraph may be constrained 
such that |6>| < s. Second, the minimum number of 
vertices onto which the collection of edges is incident 
to is constrained such that S — {(it, v) : u U v G U}, 
where U is a set of signal-vertices with \U\ < m. Edges 
in the signal-subgraph are called signal-edges. Note that 
given a collection of signal-edges, the signal-vertex set 
may not be unique. While it may be natural to treat S 
as a prior, we treat it as a parameter of the model; the 
constraints, s and m, are considered hyper-parameters. 

Note that given a specification of the class-conditional 
likelihood of each edge and class-prior, one completely 
defines a joint distribution over graphs and classes; 
the signal-subgraph is implicit in that parameterization. 
However, the likelihood parameters for all edges not 
in the signal-subgraph, p uvly = p uv Vy G y, (u, v) <£ S, 
are nuisance parameters; that is, they contain no class- 
conditional signal. When computing a relative posterior 
class estimate, these nuisance parameters cancel in the 
ratio. 

2.3 Classifier 

A graph classifier, h G %, is any function satisfying h : 
Q —> y. We desire the "best" possible classifier, h*. To 
define best, we first choose a loss function, £ h : Q x y — s- 
R + , specifically the — 1 loss function: 

£ h (G,y)=I{h(G)^y}, (3) 

where !{•} is the indicator function, equaling one when- 
ever its argument is true, and zero otherwise. Further, 
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let risk, R:TxH 
true distribution: 



l + be the expected loss under the 



A 



R(F,h)=E F [e h (G,Y)]. 



(4) 



The Bayes optimal (best) classifier for a given distribu- 
tion F minimizes risk. It can be shown that the classifier 
that maximizes the class-conditional posterior F y \g is 
optimal El: 



h* = &rgminE F {£ h (G,Y)] 
hen 

= argmax F G ,Y=yF Y =y. 
yey 



(5) 



Given the proposed model, Eq. (5) can be further factor- 
ized using the above four assumptions: 

K (G) = argmax ]~J Bem(A uv ;p uv \ y )ir y . (6) 

y £y u.ves 

Unfortunately Bayes optimal classifiers are typically un- 
available. In such settings, it is therefore desirable to 
induce a classifier estimate from a set of training data. 
Formally, let T n — {(Gi, Y^lieM denote the training 
corpus, where each graph-class pair is sampled ex- 
changeably from the true but unknown distribution: 

(Gri,Yi) cx ^ h ' Fq y- Given such a training corpus and an 
unclassified graph G, an induced classifier predicts the 
true (but unknown) class of G, h: Q x (Q x y) n y. 
When a model F^.y is specified, a beloved approach 
is to use a Bayes plugin classifier. Due to the above 
simplifying assumptions, the Bayes plugin classifier for 
this model is defined as follows. First, estimate the model 
parameters = {S,p, tv}. Second, plug those estimates 
into the above equation. The result is a Bayes plugin 
graph classifier: 



h(G;T n 



A 



argmax 
yey 



TT (1 



Puv\y 



(7) 



where the Bernoulli probability is explicit. To implement 
such a classifier estimate, we specify estimators for S, tt 
and p. 

2.4 Estimators 



• Quadratic complexity: computational time com- 
plexity should be no more than quadratic in the 
number of vertices. 

• Interpretable: we desire that the parameters are 
interpretable with respect to a subset of vertices 
and /or edges. 

In addition to the above theoretical desiderata, we also 
desire appealing finite sample and empirical perfor- 
mance. 

2.4.2 Signal-Subgraph Estimators 

Naively, one might consider a search over all possible 
signal-subgraphs by plugging each one in to the classifier 
and selecting the best performing option. This strategy 
is intractable because the number of signal-subgraphs 
scales super-exponentially with the number of vertices 
(see Figure [T| left panel). Specifically, the number of 
possible edges in a simple graph with V vertices is 
dy — (X), so the number of unique possible signal- 
subgraphs is 2( 2 ). Searching over all of them is suffi- 
ciently computationally taxing as to motivate the search 
for other alternatives. 

Before proceeding, recall that each edge is indepen- 
dent; thus, one can evaluate each edge separately (al- 
though treating edges independently is not necessarily 
advisable, consider the Stein estimator [14]). Formally, 
consider a hypothesis test for each edge. The simple null 
hypothesis is that the class-conditional edge distribu- 
tions are the same, so H : F uv i = F uv n. The composite 
alternative hypothesis is that they differ, : F uv i 7^ 
F uv ii. Given such hypothesis tests, one can construct test 
statistics Tm) : T n — ¥ R+. We reject the null in favor of 
the alternative whenever the value of the test statistic 
is greater than some critical-value: T$ '(%,) > c We 
can therefore construct a significance matrix T = 
which is the sufficient statistic for the signal-subgraph 
estimators. Example test statistics include Fisher's and 
chi-squared, which will be discussed further below. 
Whichever test statistic one uses, the sufficient statistics 
are captured in a 2 x \y | contingency table, indicating the 
number of times edge u, v was observed in each class. 
For example, the two-class contingency table for each 
edge is given by: 



2.4. 1 Desiderata 

We desire a sequence of estimators, 0i, 62, ■■ ., that satisfy 
the following desiderata: 

• Consistent: an estimator is consistent (in some spec- 
ified sense) if its sequence converges in the limit to 
the true value: lim, woo 9 n = 8. 

• Robust: an estimator is robust if the resulting 
estimate is relatively insensitive to small model 
misspecifications. Because the space of models is 
massive (uncountably infinite), it is intractable to 
consider all misspecifications, so we consider only 
a few of them, as described below. 





Class 


Class 1 


Total 


Edge 


n uv 1 


n uv\ 1 




No Edge 




™1 - n uv\l 


tx n>uv 


Total 


no 


ni 


n. 



For simplicity, we will assume that \y\ =2 for the 
remainder, though the general case is relatively straight- 
forward. 

2.4.2.1 Incoherent signal-subgraph Estimators: As- 
sume the size of the signal-subgraph, \£\ = s, is known. 
The number of subgraphs with s edges on V vertices 
is given by ( ); also super-exponential (see Figure 1 1 
middle panel). Thus, searching them all is currently 
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unconstrained incoherent coherent 




50 100 50 100 50 100 

number of vertices number of vertices number of vertices 

Fig. 1. Exhaustive searches for the signal-subgraph, even given severe constraints, are computationally intractable 
even for small graphs. The three panels illustrate the the number of unique simple subgraphs as a function of the 
number of vertices V for the three different constraint types considered: unconstrained, edge constrained, and both 
edge and vertex constrained (coherent). Note the ordinates are all log scale. On the left is the unconstrained scenario, 
that is, all possible subgraphs for a given number of vertices. In the middle panel, each line shows the number of 
subgraphs with fixed number of signal-edges, s, ranging from 1 to 1 00, incrementing by 1 with each line. The right 
panel shows the number of subgraphs for various fixed s and only a single signal-vertex; that is, all edges are incident 
to one vertex. 



computationally intractable. When s is given, under 
the independent edge assumption, one can choose the 
critical value a posteriori to ensure that only s edges are 
rejected under the null (that is, have significant class- 
conditional differences): 

minimize c 

subject to H T ™ <c}>s. (8) 

(u,v)es 

Therefore, an estimate of the signal-subgraph is the 
collection of s edges with minimal test statistics. Let 
T(i) < 7(2) < • • • < T( dv ) indicate the ordered test statis- 
tics (dropping the superscript indicating the number of 
samples for brevity). Then, the incoherent signal-subgraph 
estimator is given by S n (s) = {e(i), . . . , e( s )}, where e( u ) 
indicates the u th edge ordered by significance of its test 
statistic, T( u ). 

Note that the number of distinct test-statistic values 
is typically much smaller than the number of possible 
settings of s; specifically, the number of unique test 
statistic values will be t < min(|£|, (no + l)(n-i + 1))- 
In practice, t is often be far less than either of the 
upper bounds, because not every edge has a unique 
contingency table. In such scenarios, certain settings of 
the hyper-parameters will lead to "ties", that is, edges 
that are equally valid under the assumptions. In such 
settings, we simply randomly choose edges satisfying 
the criterion. 

Pseudocode for implementing the incoherent signal- 
subgraph estimator is provided in Algorithm [lj and 
MATLAB code is available from http:/ /jovo.me 

2.4.2.2 Coherent Signal-Subgraph Estimators: In ad- 
dition to the size of the signal-subgraph, also assume 
that each of the edges in the signal-subgraph are incident 
to one of m special vertices called signal-vertices. While 
this assumption further constrains the candidate sets 
of edges, the number of feasible sets still scales super 



Algorithm 1 Pseudocode for estimating incoherent 
signal-subgraph. 

Input: T n and s 

Output: S n {s) 
l: Compute test statistics for all (u, v) e £ 
2: Sort each edge according to its test-statistic rank, 

T(i) <; T (2 ) < • • • < T [dv) 
3: Let S n {s) = {em, • ■ ■ , e (s)}/ arbitrarily breaking ties 
as necessary. 



exponentially (see Figure [T] right panel). Therefore, we 
again take a greedy approach. 

First, compute the significance of each edge, as above, 
yielding ordered test statistics. Second, rank edges by 
significance with respect to each vertex, e fc .(x) < &k,(2) < 
• • • < e fc.(n-i) f° r all ft 6 V. Third, initialize the critical 
value at zero, c = 0. Fourth, assign each vertex a 
score equal to the number of edges incident to that 
vertex more significant than the critical value, w v:c = 
J2 U £[v] > c l- Fifth, sort the vertex significance 

scores, W{\)- c > w (2)-,c > • ■ • > w (v)\c- Sixth, check if there 
exists m vertices whose scores sum to greater than or 
equal the size of the signal-subgraph, s. That is, check 
whether the following optimization problem is satisfied: 

minimize c 

subject to W{v)\c > s - (9) 

v£ [m] 

If so, call the collection of s most significant edges from 
within that subset the coherent signal-subgraph estimate, 
S n (s,m). If not, increase c and go back to step four. 
As above, we break ties arbitrarily. Pseudocode for 
implementing the coherent signal-subgraph estimator is 
provided in Algorithm |2j and MATLAB code is available 
from http: / /jovo.me| 
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Fig. 2. An example of the coherent signal-subgraph estimate's improved accuracy over the incoherent signal- 
subgraph estimate, for a particular homogeneous two-class model specified by: M 70 {l, 20; 0.5, 0.1,0.3). Each row 
shows the same columns but for increasing the number of graph/class samples. The columns show the: (far left) 
negative log-significant matrix, computed using Fisher's exact test (lighter means more significant; each panel is 
scaled independent of the others because only relative significance matters here); (middle left) incoherent estimate of 
the signal-subgraph; (middle right) coherent estimate of the signal-subgraph; (far right) coherogram. As the number 
of training samples increases (lower rows), both the incoherent and coherent estimates converge to the truth (the 
ordinate labels of the middle panels indicate the number of edges correctly identified). For these examples, the 
coherent estimator tends to find more true edges. The coherogram visually depicts the coherency of the signal; it 
is also converging to the truth — the signal-subgraph here contains a single signal-vertex. 



2.4.2.3 Coherograms: In the process of estimating 
the incoherent signal-subgraph, one builds a "cohero- 
gram". Each column of the coherogram corresponds to 
a different critical value c, and each row corresponds to a 
different vertex v. The (c, v) th element of the coherogram 
w v - :C is the number of edges incident to vertex v with 
test statistic larger than c. Thus, the coherogram gives a 
visual depiction of the coherence of the signal-subgraph 
(see Figure |2] right column, for some examples). 



2.4.3 Likelihood Estimators 

The class-conditional likelihood parameters p uv \ y are 
relatively simple. In particular, because the graphs are 
assumed to be simple, p uv \ y is just a Bernoulli parameter 
for each edge in each class. The maximum likelihood 
estimator (MLE), which is simply the average value of 
each edge per class, is a principled choice: 

pu% E = ^ E do) 
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Algorithm 2 Pseudocode for estimating coherent signal- 
subgraph. 

Input: T n and (s, m) 
Output: S n (s,m) 



Compute test statistics T^l 1 for all (u, v) G £ 
Sort each edge according to its vertex-conditional 



< T, 



(2),fc 



< 



< T ( «, v)>fc for 



test-statistic rank, T^\ k 
all jfc G V 
Let c = 

Let w„ ;c = ^ ueV H T v,u > c} for all v G V 
Let w c = £„ e[ro] 
while iu c < s do 

Let c <- c + 1 

Update w c 
end while 

Let S n (s,m) be the collection of s edges from 
amongst those that satisfy Eq. (9) for the final value 
of c, arbitrarily breaking ties as necessary. 



where J2i\ y = y indicates the sum is over all training 
samples from class y. Unfortunately the MLE has an 
undesirable property; in particular, if the data contains 
no examples of an edge in a particular class, then the 
MLE will be zero. If the unclassified graph exhibits that 
edge, then the estimated probability of it being from that 
class is zero, which is undesirable. We therefore consider 
a smoothed estimator: 



Vn 



Puv\y 



where we let r] n 



= <i 



if maxi ciul = 
u „t„.„W - 1 



(11) 



Puv\ y E otherwise 
l/(10n). 



2.4.4 Prior Estimators 

The priors are the simplest. The prior probabilities are 
Bernoulli, and we are concerned only with the case 
where \y\ <C n, so the maximum likelihood estimators 
suffice: 



'"v 
n 



(12) 



where n y = £\ £[n] HVi = v}- 



2.4.5 Hyper-Parameter Selection 

The signal-subgraph estimators require specifying the 
number of signal-edges s, as well as the number of 
signal-vertices m for the coherent classifier. In both cases, 
the number of possible values of finite. In particular, 
s G [d v ] and m G [V]. Thus, to select the best hyper- 
parameters we implement cross-validation procedures 
(see Section 2.5.2| for details), iterating over (s,m) G s x 
™ Q [dv] x [V]. Note that when m = V , the coherent sig- 
nal subgraph estimator reduces to the incoherent signal 
subgraph estimator. For all simulated data, we compare 
hyper-parameter performance via a training and held- 
out set. For the real data application, we decided to use 



a leave-one-out cross-validation procedure due to the 
small sample size. 

2.4.6 All together 

Putting the above pieces together, Algorithm [3] 
provides pseudo-code for implementing our signal- 
subgraph classifiers. MATLAB code is available from the 
first author's website, http://jovo.me 

Algorithm 3 Pseudocode for training signal-subgraph 

classifiers. 

Input: T n and a set of constraints (s, in) 

Output: S n , {Puv\y}( uv ) e s n '{Ky}y£{OA} 
1: Partition the data for the appropriate cross-validation 
procedure 



Estimate p uv \ y for all (it, v) using Eq. (11) 
Estimate n y for all y using Eq. (12) 



for all (s, m) G (s, rh) do 

Compute S n (s,m) using Algorithm [l] or [2J as 
appropriate 

Compute cross-validated error L SiTn using Eq. |(13) 
end for 

Let S n = argmin (s m) L s<m 



2.5 Finite Sample Evaluation Criteria 

2.5. 1 Likelihoods and priors 

The likelihood and prior estimators will be evaluated 
with respect to robustness to model misspecifications, 
finite samples, efficiency, and complexity. 

2.5.2 Classifier 

We evaluate the classifier's finite sample properties using 
either held-out or leave-one-out misclassification per- 
formance, depending on whether the data is simulated 
or experimental, respectively. Formally, given C equally 
sized subsets of the data, {71, . . . , Tc}, the cross-validated 
error is given by 



^Epf^n EW^/rf- ( 13 ) 

1 |/?A/c| G$T C 



c 



Given this definition, let be the error of the classifier 
using only the prior estimates, and let be the error 
for the Bayes optimal classifier. 

To determine whether a classifier is significantly better 
than chance, we randomly permute the classes of each 
graph riMC times, and then estimate a naive Bayes 
classifier using the permuted data, yielding an empirical 
distribution. The p-value of a permutation test is the 
minimum fraction of Monte Carlo permutations that did 
better than the classifier of interest |15|. 

To determine whether a pair of classifiers are signif- 
icantly different, we compare the leave-one-out classifi- 
cation results using McNemar's test |16|. 
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2.5.3 Signal-Subgraph Estimators 

To evaluate absolute performance of the signal-subgraph 
estimators, we define "miss-edge rate" as the fraction of 
true edges missed by the signal-subgraph estimator: 

i£ = rir (14) 



IS I 



Note that when |<S| is fixed, miss-edge rate is a suffi- 
cient statistic for all combinations of false/ negative pos- 
itive/negative results. Further, we estimate the relative 
rate and relative efficiency to evaluate the relative finite 
sample properties of a pair of consistent estimators. The 
relative rate is simply (1 - R™ c )/(l - R c n oh ). Relative 
efficiency is the number of samples required for the co- 
herent estimator to obtain the same rate as the incoherent 
estimator. 

3 Estimator Properties 

3.1 Likelihood and Prior Estimators 



Lemma 3.1. p uv \ v as defined in Eq. (11) is an L-estimator. 



Proof Huber defines an L-estimator as an estimator 
that is a linear combination of (possibly nonlinear func- 
tions of) the order statistics of the measurements \V7\. 
Indeed, p uv \ v is a thresholded function of the minimum, 
maximum, and mean. □ 
Because L-estimators converge to the MLE, our estima- 
tors share all the nice asymptotic properties of the MLE. 
Moreover, L-estimators are known to be robust to certain 
model misspecifications [17]. The prior estimators are 
MLE's, and therefore also consistent and efficient. Both 
prior and likelihood estimates are trivial to compute, as 
closed-form analytic solutions are available for both. 

3.2 Signal-Subgraph Estimators 

A variety of test statistics are available for computing the 
edge-specific class-conditional signal, T«™ . Fisher's exact 
test computes the probability of obtaining a contingency 
table equal to, or more extreme than, the table resulting 
from the null hypothesis: that the two classes have the 
same probability of sampling an edge. In other words, 
Fisher's exact test is the most powerful statistical test 
assuming independent edges llT8l . This leads to the 
following lemma: 



Lemma 3.2. S n (s',m') — > S as 



> oo when computing 
Tuv via Fisher's exact test, even when s and m are unknown, 



as long s' > s and ml > m. 

Proof: Whenever p uv \ Q ^ Puvlir the p-value of 
Fisher's exact test converges to zero; whereas whenever 
Puv\o = Pw\it tne distribution of p-values converges 
to the uniform distribution on [0,1]. Therefore, Fisher's 
exact test induces a consistent estimator of the signal- 
subgraph as n —> oo, assuming a fixed and finite V. 



Moreover, as V — > oo, as long as V/n — > 0, Fisher's exact 
test remains consistent I1T81 . □ 
While most powerful, computing Fisher's exactly is 
computationally taxing. Fortunately, the chi-squared test 
is asymptotically equivalent to Fisher's test, and there- 
fore shares those convergence properties [18|. Even the 
absolute difference of MLE's, |p M f 1 E - p M f n B l, which 
is trivially easy to compute, is asymptotically equiva- 
lent to Fisher's [18] and therefore consistent. More- 
over, the signal-subgraph estimators are robust to a 
variety of model misspecifications. Specifically, as long 
as all the marginal probability of all the edges in the 
signal-subgraph are different between the two classes, 
Puv\i 7^ Puv\0/ an d the constraints are upper-bounds 
on the true values, s' > s and m! > m, then any 
consistent test statistic will yield a consistent signal- 
subgraph estimator. Estimating the coherent signal- 
subgraph is more computationally time consuming than 
estimating the incoherent signal-subgraph. What is lost 
by computational time, however, is typically gained by 
finite sample efficiency whenever the model does not 
induce too much bias, as will be shown below. 



3.3 Bayes Plugin Classifier 

Lemma 3.3. The Bayes plug-in classifier, using the signal- 
subgraph, likelihood, and prior estimators described above, is 



consistent under the model defined by Eq. (2) 



Proof A Bayes plugin classifier is a consistent clas- 
sifier whenever the estimates that are plugged in are 
consistent |13|. Because the likelihood, prior, and signal- 
subgraph estimates are all consistent, the Bayes plugin 
classifier is also consistent. □ 
Note that naive Bayes classifiers often exhibit impres- 
sive finite sample performance due to their winning the 
bias-variance trade-off relative to other classifiers [19|. 
In other words, even when edges are highly dependent, 
because marginal probability estimates are more efficient 
than joint probability estimates, an independent edge 
based classifier will often outperform a classifier based 
on dependencies. 

4 Simulated Experiments 

4.1 Simulation Details 

To better assess the finite sample properties of the signal- 
subgraph estimators, we conduct a number of simulated 
experiments. Consider the following homogeneous model: 
each simple graph has V = 70 vertices. Class graphs 
are Erdos-Renyi with probability p for each edge; that is, 
fuv\o = p V (u, «) € £ ■ Class 1 graphs are a mixture of two 
Erdos-Renyi models: all edges in the signal-subgraph have 
probability q, and all others have probability p, so that 
fuv\i = gV(ji,«) e S, and f uv]1 = p\/(u,v) e £\S. The 
signal-subgraph is constrained to have m signal-vertices 
and s signal-edges. Let the class-prior probabilities be 
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given by Fy=o = tt and Fy=i = 1 — n. Thus, the model 
is characterized by Fg = M. v {vn, s; 7r,p, q), where V is a 
constant, m and s are hyper-parameters, and n, p and q 
are parameters. 

4.2 A Simple Demonstration 

To provide some insight with respect to the finite sam- 
ple performance of the incoherent and coherent signal- 
subgraph estimators for this model, we run the following 
simulated experiments, with results depicted in Figure [2] 
In each row we sample from M.7o(l, 20; 0.5, 0.1, 0.3) (note 
that we are actually conditioning on the class-conditional 
sample size). Given these n samples, we compute the 
significance matrix (first column), which contains the 
sufficient statistics for both estimators. The incoherent 
estimator simply chooses the s most significant edges as 
the signal-subgraph (second column). The coherent esti- 
mator jointly estimates both the m signal-vertices and the 
s signal-edges incident to at least one of those vertices 
(third column). The coherogram shows the "coherency" 
of the data (fourth column). 

From this figure, one might notice a few tendencies. 
First, both the incoherent and coherent signal-subgraph 
seem to converge to the true signal-subgraph. Second, 
while both estimators perform poorly with n < 16, the 
coherent estimator converges more quickly than the in- 
coherent estimator. Third, the coherogram sharpens with 
additional samples, showing after only approximately 50 
samples that this model is strongly coherent. 

4.3 Quantitative Comparisons 

To better characterize the relative performance of the 
two signal-subgraph estimators, Figure [3] shows their 
performance as a function of the number of training 
samples, n, for the A47o(l, 20; 0.5, 0.1, 0.3) model. The 
top panel shows the mean and standard error of the 
missed-edge rate — the fraction of edges incorrectly iden- 
tified — averaged over 200 trials. For essentially all n, 
the coherent estimator (black solid line) performs better 
than the incoherent estimator (gray solid line). We also 
compare the performance of an ^i-penalized logistic re- 
gression classifier ('lasso' hereafter [20]). As expected, the 
missed edge rate for the lasso (gray dashed line) and the 
incoherent classifier are about the same. The improve- 
ment in signal-edge detection of the coherent signal- 
subgraph estimator over the incoherent's and lasso's per- 
formance translates directly to improved classification 
performance (middle panel), where the plugin classifier 
using the coherent signal-subgraph estimator has a better 
misclassification rate than either the incoherent signal- 
subgraph classifier and the lasso for essentially all n. 
Note that the incoherent classifier also admits better 
performance than the lasso. This is expected — although 
they are very similar — the incoherent classifier was de- 
rived specifically for this joint graph/class model. For 
comparison purposes, the naive Bayes plugin classifier; 
that is, the classifier that assumes the whole graph is the 



signal-subgraph, is also shown (black dashed line). Note 
that the performance of all the classifiers is bounded 
above by = 0.5 and below by = 0.13. Moreover, 
L n b > Li asso > L inc > L coh for essentially all n. 

An important aspect of any algorithm is compute 
time, both of training and testing. The signal-subgraph 
classifiers that we developed are very fast. Computations 
essentially amount to computing a test-statistic for all 
\£\ edges, then sorting them. The parameter estimates 
of the likelihood and prior terms come directly from 
the same test-statistics used to obtain the significance of 
each edge. Thus, obtaining those estimates amounts to 
essentially computing a mean. On the other hand, the 
lasso classifier, which yields worse signal detection and 
misclassification rates than both our classifiers, requires 
an iterative algorithm for each value on the hyper- 
parameter path 120 1 . Despite that efficient computational 
schemes have been developed for searching the whole 
regularization path 1211 . such iterative algorithms should 
be much slower than our classifiers. 

Indeed, the lower panel of Figure [3] demonstrates that 
our MATLAB implementation of the signal-subgraph 
classifiers are approximately 10 times faster than MAT- 
LAB's lasso implementation. All the results shown in 
Figure [3] include errorbars computed from 100 trials, 
each with 100 held-out samples, demonstrating that for 
these simulation parameters, the differences are highly 
significant. Although the quantitative results may vary 
for different implementations and different parameter 
settings, our expectation is that the qualitative results 
should be consistent. Because our classifiers have lower 
risk, better signal identification, and run an order of 
magnitude faster than the standard, we do not consider 
lasso in further simulations. 

The above numerical results suggest that the coherent 
estimator achieves better signal-subgraph identification 
and classification performance than the incoherent es- 
timator almost always, despite that the computational 
time of the coherent classifier is almost identical. How- 
ever, that result is a function of both the model Aiy 
(which includes the number of vertices), and the number 
of training samples n (there is a bias-variance trade- 
off here, as always). Figure [4] explicitly shows that the 
relative performance of an estimator for a particular 
model — A^3o(l, 5; 0.5, 0.1, 0.2) — changes as a function of 
the number of samples. More specifically, for small n, 
the incoherent estimator yields better performance, as 
indicated by the relative rate and relative efficiency 
being above one. However, with more samples, when the 
signal-subgraph is coherent, the coherent estimator will 
eventually outperform the incoherent one. At infinite 
samples, since both estimators are consistent, they will 
yield identical results: the truth. 

Thus, to choose which estimator will likely achieve 
the best performance, knowledge of the model, 
M.y (m, s; 7r,p, q), is insufficient; rather, both the model 
and the number of samples must be known a priori. 
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Fig. 3. Performance statistics as a function of sample size 
demonstrate that the coherent signal-subgraph estimator 
outperforms the incoherent signal-subgraph estimator, 
in terms of both the signal-subgraph identification and 
classification, for nearly all n, using the same model as 
in Figure |2| 7W 70 (1,20;0.5,0.1,0.3). Moreover, even the 
incoherent classifier outperforms the ^-penalized logistic 
regression (lasso) on all our metrics. The top panel shows 
the missed-edge rate for each estimator as a function of 
the number of training samples, n. The middle panel 
shows the corresponding misclassification rate for the 
estimators, as well as the naive Bayes plugin classifier. 
Performance of all estimators improves (nearly) mono- 
tonically with n for both criteria. The bottom panel shows 
total training and testing time for each classifier. Clearly, 
the lasso is about 10 times slower than the others. Error 
bars show standard error of the mean here and elsewhere 
unless otherwise noted (averaged over 100 trials; each 
trial used 100 samples for held-out data). Error bars on 
the lower panel show the inter-quartile range. Note that 
for most values of n, we have > L nb > L lasso >L inc > 
L CO h > £*■ Legend: "inc": incoherent; "coh": coherent; 
"nb": naive Bayes, "lasso": lasso. 

4.4 Estimating the Hyper-Parameters 

In the above analyses the hyper-parameters, both the 
number of signal-edges s and signal-vertices m, were 
known. In practice while one might have a preliminary 
guess of the range of these hyper-parameters, the opti- 
mal values will usually be unknown. We can therefore 
use a cross-validation technique to search over the space 
of all reasonable combinations of s and m, and choose 
the best performing combination. Figure [5] shows one 
such simulation depicting several key features. The top 
panel shows the misclassification rate on held-out data 



Fig. 4. The relative performance of the coherent and 
incoherent estimators is a function not just of the model, 
but also the number of training samples. Specifically, for 
the same model, M 70 (l, 20; 0.5, 0.1, 0.3), we compute the 
missed-edge rates for both the incoherent estimator (gray 
line) and the coherent estimator (black line), averaged 
over 200 trials. The top panel shows that for small training 
sample size the incoherent estimator achieves a better 
(lower) missed-edge rate than the coherent estimator. 
However, the incoherent estimator's convergence rate 
is slower, and the coherent estimator catches up and 
outperforms the incoherent estimator until both eventually 
converge at the truth. The middle and bottom panels show 
the relative rate and efficiency curves for this model. Note 
that the curves dip below unity, and then converge to unity, 
as they must, because both estimators are consistent. 



as a function of the log of the assumed size of the 
signal-subgraph for the incoherent classifier. Although 
the true size is s = 20, the best performing estimate is 
Sine = 23. This is a relatively standard result in model 
selection: the best performer will include a few extra 
dimensions because adding a few uninformative features 
is less costly than missing a few informative features 
|22J. This intuition is further reified by the U-shape of 
the misclassification curve on a log scale: including many 
non-signal-edges is less detrimental than excluding a few 
signal-edges. 

The bottom panel shows the coherent performance 
by varying both m and s, which exhibits a "banded" 
structure, indicating that the performance is relatively 
robust to small changes in m. This banding likely results 
from the fact that the test statistics are identical for many 
edges, so therefore minor changes in the number of 
allowable edges is not expected to change performance 
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much. The best performing pair achieved L co h = 0.13 
(which is equal to the Bayes error) with fhcoh = 1 and 
Sco/i = 24, suggesting that n was sufficiently large to 
correctly find the true signal-vertex, and further corrob- 
orating the "better safe than sorry" attitude to selecting 
the signal-edges. 



incoherent estimator 




1 23 100 1000 

log assumed # of signal edges 

coherent estimator 




200 400 600 800 1000 1200 
assumed # of signal edges 



Fig. 5. When constraints on the number of signal- 
edges (s) or signal-vertices (m) are unknown, a search 
over these hyperparameters can yield estimates s and 
fh. Both panels depict held-out cross-validation error as 
a function of varying these parameters for the model 
7W 70 (1,20;0.5,0.1,0.3) (the same as in Figures [2] and [3}, 
using 200 training samples and 500 test samples, with 
m = 1 and s = 20. The top panel depicts misclassification 
rate of the incoherent estimator as a function of the 
number of estimated signal-edges on a log scale, with the 
best performing classifier achieving L inc = 0.21. Note that 
in this simulation, s = 20 < s mc = 23. This "conservatism" 
is typical and appropriate in many model selection situa- 
tions. The bottom panel shows L coh as a function of both 
m! and s'. For this simulation, fh coh = 1 and 's CO h = 24, 
further corroborating the conservative stance on model 
selection. Note that > L nb > L inc > L coh > as one 
would hope for this coherent simulation. Incidentally, the 
coherent classifier achieved Bayes error here, i* = 0.13. 

5 MR CONNECTOME SEX CLASSIFICATION 

A connectome is brain-graph 1231 . MR connectomes 
utilize multi-modal Magnetic Resonance (MR) imaging 



to determine both the vertex and edge set for each 
individual Q. This section investigates the utility of 
the classifiers developed above on data collected for 
the Baltimore Longitudinal Study of Aging, as described 
previously |24| . Briefly, 49 subjects (25 male, 24 female) 
underwent a diffusion-weighted MRI protocol. The Mag- 
netic Resonance Connectome Automated Pipeline (MR- 
CAP) was used to convert each subject's raw multi- 
modal MR data into a connectome |25 1 (each connectome 
is a simple graph with 70 vertices and up to ( 7 2 °) = 2415 
edges). Lacking strong priors on either the number of 
signal-edges or signal-vertices in the signal-subgraph 
(or even whether a signal-subgraph exists), we searched 
over a large space of hyper-parameters using leave-one- 
out cross-validated misclassification performance as our 
metric of success (Figure |6|. The naive Bayes classifier — 
which assumes the signal-subgraph is the whole edge 
set, S n b = £ — performs marginally better than chance: 
L n b = 0.41 (p-value « 0.05 assessed by a permuta- 
tion test). With a relatively small number of incoherent 
edges — s'inc = 10 — the incoherent classifier (top left 
panel) achieves L- mc = 0.27, significantly better than 
chance (p-value < 0.0007), but not significantly bet- 
ter than the naive Bayes classifier (using McNemar's 
test). The coherent classifier achieved a minimum of 
L co h = 0.16 (top right and middle panels), significantly 
better than both chance and the naive Bayes classifier (p- 
values < 10~ 5 and < 0.004, respectively). This improved 
performance upon using the coherent classifier suggests 
that the signal-subgraph is at least approximately co- 
herent. Using fh co h = 12 and "Scoh — 360 from the 
best performing coherent classifier, we can estimate the 
signal-subgraph (bottom left). The coherogram suggests 
that indeed, the signal is somewhat, but not entirely 
coherent (bottom right). 

We next compare the performance of our classifiers on 
this MR connectome sex classification data set to several 
other classifiers. First, a standard parametric classifier: 
lasso. We chose the regularization parameter via a 10- 
fold cross-validation. Second, a non-parametric (distri- 
bution free) classifier: fc„-nearest neighbor (fcNN), which 
operates directly on graphs [26]. This fcNN classifier 
uses the Frobenius norm distance metric. We tried all 
k E [n] and simply report the best performance. The 
universal consistency of this fcNN classifier is useful 
in assessing the algorithm complexity supported by 
this data. In particular, given enough samples, fcNN 
will achieve optimal performance. Less than optimal 
performance therefore indicates that the sample size is 
not sufficiently large for this fcNN classifier. Third, a 
graph invariant based classifier. We computed six graph 
invariants for each graph: size, max degree, scan statistic, 
number of triangles, clustering coefficient, and average 
path length, normalized each to have zero mean and 
unit variance, and then used a fcNN with £ 2 distance 
metric on the invariants. These particular invariants were 
chosen based on their desirable statistical properties E71 - 
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TABLE 1 

Bake-off comparing a number of different classifiers on 
the MR connectome sex classification data. Error 

indicates misclassification error using the best 
hyper-parameters found for each classifier. P-value 
indicates the p-value of a one-sided McNemar's test 
comparing each classifier to the best signal-subgraph 
classifier. The signal-subgraph classifier is significantly 
better than all the others. 



incoherent estimator 



coherent estimator 



classifier 


error 


p-val 


prior 


0.50 


< 0.01 


naive Bayes 


0.41 


< 0.01 


lasso 


0.27 


< 0.02 


graph-fcNN 


0.35 


< 0.02 


invariant-fcNN 


0.43 


< 0.01 


signal-subgraph 


0.16 


n\a 



Despite the small sample size, Table [T] demonstrates 
that the signal-subgraph classifier is significantly better 
than all the others, as assessed via a one-sided McNe- 
mar's test. 

5.1 Model Evaluation 

We investigate to what extent the above estimated signal- 
subgraph represents the true signal-subgraph. We ad- 
dress this question in two ways: (i) synthetic data anal- 
ysis and (ii) assumption checking. 

5. 1. 1 Synthetic Data Analysis 

For the synthetic data analysis, we generated data as 
follows. Given the above estimated signal-subgraph, for 
every edge not in S n , let p uv \ = p uv \ t = p uvr where 
p uv is the estimated edge probability averaging over all 
samples. For all edges in S n , let p uv \ y = p uv \y Set the 
priors according to the data as well: ir = tt. 

Given this synthetic data model, we first generated 

49 data samples, 25 from class and 24 from class 
1, and estimated the incoherent and coherent classifier 
performance on a single synthetic experiment (Figure 
|7j top panels). The performance of the classifiers on 
the synthetic data qualitatively mirrors that of the real 
data, suggesting some degree of model appropriateness. 
To assess what fraction of the edges in the estimated 
signal-subgraph were reliable, even assuming a true 
model, we then sampled up to 100 training samples (and 
100 test samples), and computed the missed-edge rate 
(bottom left) and misclassification rate (bottom right) as a 
function of the number of samples. Given approximately 

50 samples, the incoherent signal-subgraph estimator 
correctly identifies about 40% of the edges, whereas 
the coherent signal-subgraph estimator correctly iden- 
tifies about 50%. This suggests that even if the model 
were true (which we doubt) we are justified to believe 
that only about half the edges in the estimated signal- 
subgraph are in the actual signal-subgraph. Despite our 




10 10 10 

log assumed # of signal edges 
assumed m=1 2 coherent estimator 



200 400 600 8001000 
assumed # of signal edges 
zoomed in coherent estimator 




Z. CDh =0.16 

i0°~~ io 2 

log assumed # of signal edges 
coherent signal subgraph estimate 



400 500 600 
assumed # signal edges 
coherogram 



I 




20 








40 










60 


=^ 









0.04 0.14 0.29 0.55 

threshold 



Fig. 6. MR connectome sex signal-subgraph estimation 
and analysis. By cross-validating over hyperparameters 
and models, we estimate that the "best" incoherent signal- 
subgraph (for this inference task on these data) has 
Sine = 10 and yields a misclassification rate of L inc = 
0.27, whereas the best coherent signal-subgraph has 
m coh = 12 and s coh = 360, achieving L coh = 0.16. The 
top two panels depict the same information as Figure 
5. The middle two depict misclassification rate (left) for 
different choices of m' = 12 as a function of s' and 
(right) a zoomed-in depiction of the top right panel. The 
bottom left panel shows the estimated signal-subgraph, 
and the bottom right shows the coherogram. Together, 
these bottom panels suggest that the signal-subgraph for 
these data is at least somewhat coherent. 



stated desideratum of interpretability of the resulting 
classifier in terms of correctly identifying the signal- 
edges and vertices, for data sampled from this assumed 
distribution, sample sizes of < 50 seem to be insufficient. 
That said, both missed-edge rate and misclassification 
rate exhibit a step-like function in performance: after 
about 50 samples, performance dramatically improves. 
This suggests that perhaps only a few more data points 
would be necessary to obtain greatly improved classifi- 
cation accuracy. 

5. 1.2 Model Checking 

The assumption of independence between edges is (i) 
very useful for algorithms and analysis, and (ii) almost 
certainly nonsense for real connectome data. Checking 
whether edges are independent is relatively easy. Figure 
[8] shows the correlation coefficient between all pairs of 
edges in the estimated signal-subgraph from the neuro- 
biological data. We used a spectral clustering algorithm 
1 30] to more clearly highlight any significant correlations. 
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Fig. 7. Synthetic data analysis provides some intuition for model checking and future improvements. The top two 
panels show the incoherent (left) and coherent (right) misclassification rates as a function of the hyper-parameter 
choices for n = 49. These plots look quite similar to those obtained in the real connectome data (Figure [6}, 
which suggests that the chosen model may be adequate. The bottom panels show the missed-edge rate (left) 
and misclassification rate (right) as a function of the number of training samples. With about 50 training samples, 
approximately half of the edges identified by each classifier are true edges. Additionally, slightly more than 50 training 
samples seems to be sufficient for obtaining nearly perfect classification, suggesting that perhaps only a few more 
subjects would be sufficient to yield much greater classification performance. 



Several groups of edges seem to be highly correlated. To 
assess significance, we compare the distribution of cor- 
relation coefficients with the distribution of correlation 
coefficients obtained from the synthetic data analysis. 
A two-sample Kolmogorov-Smirnov test shows that the 
two matrices are significantly different (p-value « 0), 
rejecting the null hypothesis that the edges in the real 
data are independent. This analysis further corroborates 
that making independence assumptions can be fruitful 
even when the data are dependent ] 19 ] . 

6 Discussion 

This work makes the following contributions. First, 
it introduces a novel graph/class model that admits 
rigorous statistical investigation. Moreover, it presents 
two approaches for estimating the signal-subgraph: the 
first using only vertex label information, the second 
also utilizing graph structure. The resulting estimators 
have desirable asymptotic and finite sample properties, 
including consistency and robustness to various model 
misspecifications. Third, simulated data analysis indicate 
that neither approach dominates the other; rather, the 
best approach is a function of both the model and the 
amount of training data. And while the lasso classifier 
has similar error properties to our incoherent classifier, 
lasso's computational time is about an order of magni- 
tude longer. Fourth, these classifiers are applied to an 
MR connectome sex classification data set; the coherent 
classifier performs significantly better than a variety 
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Fig. 8. The correlation matrix between all the edges in the 
coherent signal-subgraph estimate. Edges are organized 
by co-clustering to highlight any similarities. Although 
most edges are uncorrelated, several groups of edges 
cluster, indicative of the fact that the edges are not inde- 
pendent (p-value of « using a two-sample Kolmogorov- 
Smirnov test comparing the real and synthetic correlation 
matrices). 
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of benchmark classifiers. Fifth, synthetic data analysis 
suggests that while we can use the signal-subgraph esti- 
mators to improve classification performance, we should 
not expect that all the edges in the estimated signal- 
subgraph will be the true signal-edges, even when the 
model is correct. Moreover, we might expect a drastic 
improvement in classification performance with only a 
few additional data samples. Finally, model checking 
suggests that the independent edge assumption does not 
fit the data well. 

Our signal-subgraph classifiers represent somewhat 
of a departure from previous work. Most graph clas- 
sification algorithms come from the "structural pattern 
recognition" school of thought [1], lacking an explicit 
statistical model and associated provable properties. On 
the other hand, most work on "statistical pattern recogni- 
tion" begins by assuming the data to be classified are Eu- 
clidean vectors 13TI . Our work is a unification of the two. 
Moreover, because the sufficient statistics are essentially 
encoded in a matrix, our work can be related to recent 
developments in matrix decompositions. For example, 
sparse and low-rank matrix decompositions are close in 
spirit to our coherent signal subgraph estimators 132 H341 . 
Note, however, that our coherent estimator is robust to 
signal-vertices having a subset of its edges highly non- 
significant; that is, the coherent signal-subgraph estima- 
tor can be thought of as a local sparse and low-rank 
decomposition. 

Collectively, the above analyses suggest a number of 
possible next steps. First, collect more data. Second, relax 
various assumptions, including (i) the independent edge 
assumption by considering conditionally independent 
edges [35— 37|, (ii) binary edge and class assumptions, 
and (iii) labeled vertices assumption. Specifically, exten- 
sion to situations for which none of the vertices are 
labeled [38: 39], only some subset of vertices are labeled 
lEOl Ell , or data are otherwise errorfully observed [42 1, 
are all avenues of future investigation. Third, trans- 
form a number of conjectures that have arisen due to 
these results into theorems. For instance, perhaps the 
misclassification rate is a monotonic function of the 
missed-edge rate. Fourth, (Bayesian) model-averaging to 
combine estimated signal-subgraphs instead of picking 
one might improve performance (perhaps at the cost of 
computational resources and interpretability). 

We hope the proposed approaches will yield many 
applications. To that end, all the data and code used 
in this work is available from the author's website, 
|http:/ /jovo.me[ 
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